In [1]:
%pylab inline
In [2]:
import numpy as np
import pandas as pd
import networkx as nx
from tethne.readers import wos
import igraph
import nltk
from collections import Counter
from tethne import Corpus
from helpers import extract_keywords, filter_token, normalize_token
In this workbook we will conduct a co-citation analysis using the approach outlined in Chen (2009). If you have used the Java-based desktop application CiteSpace II, this should be familiar: this is the same methodology that is implemented in that application.
Co-citation analysis gained popularity in the 1970s as a technique for “mapping” scientific literatures, and for finding latent semantic relationships among technical publications.
Two papers are co-cited if they are both cited by the same, third, paper. The standard approach to co-citation analysis is to generate a sample of bibliographic records from a particular field by using certain keywords or journal names, and then build a co-citation graph describing relationships among their cited references. Thus the majority of papers that are represented as nodes in the co-citation graph are not papers that responded to the selection criteria used to build the dataset.
Our objective in this tutorial is to identify papers that bridge the gap between otherwise disparate areas of knowledge in the scientific literature. In this tutorial, we rely on the theoretical framework described in Chen (2006) and Chen et al. (2009).
According to Chen, we can detect potentially transformative changes in scientific knowledge by looking for cited references that both (a) rapidly accrue citations, and (b) have high betweenness-centrality in a co-citation network. It helps if we think of each scientific paper as representing a “concept” (its core knowledge claim, perhaps), and a co-citation event as representing a proposition connecting two concepts in the knowledge-base of a scientific field. If a new paper emerges that is highly co-cited with two otherwise-distinct clusters of concepts, then that might mean that the field is adopting new concepts and propositions in a way that is structurally radical for their conceptual framework.
Chen (2009) introduces sigma ($\Sigma$) as a metric for potentially transformative cited references:
$$ \Sigma(v) = (g(v) + 1)^{burstness(v)} $$...where the betweenness centrality of each node v is:
$$ g(v) = \sum\limits_{i\neq j\neq v} \frac{\sigma_{ij} (v)}{\sigma_{ij}} $$...where $\sigma_{ij}$ is the number of shortest paths from node i to node j and $\sigma_{ij}(v)$ is the number of those paths that pass through v. Burstness (0.-1. normalized) is estimated using Kleingberg’s (2002) automaton model, and is designed to detect rate-spikes around features in a stream of documents.
Note: In this notebook we will not use burstness, but rather the relative increase/decrease in citations from one year to the next. Burstness is helpful when we are dealing with higher-resolution time-frames, and/or we want to monitor a long stream of citation data. Since we will smooth our data with a multi-year time-window, burstness becomes a bit less informative, and the year-over-year change in citations (we'll call this Delta $\Delta$) is an intuitive alternative.
Here we have some field-tagged data from the Web of Science. We set streaming=True
so that we don't load everything into memory all at once.
Note: When we stream the corpus, it is important to set index_fields
and index_features
ahead of time, so that we don't have to iterate over the whole corpus later on.
In [43]:
metadata = wos.read('../data/Baldwin/PlantPhysiology',
streaming=True, index_fields=['date', 'abstract'], index_features=['citations'])
In [44]:
len(metadata)
Out[44]:
In [5]:
from tethne import cocitation
Co-citation graphs can get enormous quickly, and so it is important to set a threshold number of times that a paper must be cited to be included in the graph (min_weight
). It's better to start high, and bring the threshold down as needed.
Note that edge_attrs
should be set to whatever was the value of index_fields
when we used read()
(above).
In [6]:
graph = cocitation(metadata, min_weight=6., edge_attrs=['date'])
In [7]:
graph.order(), graph.size(), nx.number_connected_components(graph)
Out[7]:
In [8]:
nx.write_graphml(graph, 'cocitation.graphml')
Chen (2009) proposed sigma ($\Sigma$) as a metric for potentially transformations in a scientific literature.
$$ \Sigma(v) = (g(v) + 1)^{burstness(v)} $$Note: In this notebook we will not use burstness, but rather the relative increase/decrease in citations from one year to the next. Burstness is helpful when we are dealing with higher-resolution time-frames, and/or we want to monitor a long stream of citation data. Since we will smooth our data with a multi-year time-window, burstness becomes a bit less informative, and the year-over-year change in citations (we'll call this Delta $\Delta(v)$) is an intuitive alternative. So:
$$ \Sigma(v) = (g(v) + 1)^{\Delta(v)} $$$$ \Delta(v) = \frac{N_t(v) - N_{t-1} }{max(1, N_{t-1})} $$Since we are interested in the evolution of the co-citation graph over time, we need to create a series of sequential graphs. Tethne provides a class called GraphCollection
that will do this for us.
We pass metadata
(or Corpus
object), the cotation
function, and then some configuration information:
slice_kwargs
controls how the sequential time-slices are generated. The default is to use 1-year slices, and advance 1 year per slice. We are stating here that we want to extract only the citations
feature from each slice (for performance).method_kwargs
controls the graph-building function. Here we pass our min_weight
, and we also say that we don't want any attributes on the edges (for performance).
In [9]:
from tethne import GraphCollection
G = GraphCollection(metadata, cocitation,
slice_kwargs={'feature_name': 'citations'},
method_kwargs={'min_weight': 3, 'edge_attrs': []})
In [12]:
for year, graph in G.iteritems():
print graph.order(), graph.size(), nx.number_connected_components(graph)
nx.write_graphml(graph, 'cocitation_%i.graphml' % year)
In [13]:
# 'betweenness_centrality' is the name of the algorithm
# in NetworkX that we want to use. ``invert=True`` means
# that we want to organize the g(v) values by node, rather
# than by time-period (the default).
g_v = G.analyze('betweenness_centrality', invert=True)
In [26]:
g_v.items()[89]
Out[26]:
In order to calculate $\Sigma$ more efficiently, we'll organize our data about the graph in a DataFrame. Our DataFrame will have the following columns:
This may take a bit, depending on the size of the graphs.
In [23]:
node_data = pd.DataFrame(columns=['ID', 'Node', 'Year', 'Citations', 'Centrality', 'Delta'])
i = 0
for n in G.node_index.keys():
if n < 0:
continue
# node_history() gets the values of a node attribute over
# all of te graphs in the GraphCollection.
g_n = G.node_history(n, 'betweenness_centrality')
N_n = G.node_history(n, 'count')
# Skip nodes whose g(v) never gets above 0.
if max(g_n.values()) == 0:
continue
years = sorted(G.keys()) # Graphs are keyed by year.
for year in years:
g_nt = g_n.get(year, 0.0) # Centrality for this year.
N_nt = float(N_n.get(year, 0.0)) # Citations for this year.
# For the second year and beyond, calculate Delta.
if year > years[0]:
N_nlast = N_n.get(year-1, 0.0)
delta = (N_nt - N_nlast)/max(N_nlast, 1.)
else:
delta = 0.0
# We will add one row per node per year.
node_data.loc[i] = [n, G.node_index[n], year, N_nt, g_nt, delta]
i += 1
That was fairly computationally expensive. We should save the results so that we don't have to do that again.
In [29]:
node_data.to_csv('node_data.csv', encoding='utf-8')
Before calculating $\Sigma$ we will select a subset of the rows in our dataframe, to reduce the computational burden. Here we create a smaller DataFrame with only those rows in which both Centrality and Delta are not zero -- it should be obvious that these will have negligible $\Sigma$.
In [30]:
# Note the ``.copy()`` at the end -- this means that the new DataFrame will be a
# stand-alone copy, and not just a "view" of the existing ``node_data`` DataFrame.
# The practical effect is that we can add new data to the new ``candidates``
# DataFrame without creating problems in the larger ``node_data`` DataFrame.
candidates = node_data[node_data.Centrality*node_data.Delta > 0.].copy()
Now we calculate $\Sigma$. Vector math is great!
In [31]:
candidates['Sigma'] = (1.+candidates.Centrality)**candidates.Delta
The nodes (in a given year) with the highest $\Sigma$ are our candidates for potential "transformations". Note that in Chen's model, the cited reference and its co-cited references are an emission of the "knowledge" of the field. In other words, the nodes and edges in our graph primarily say something about the records that are doing the citing, rather than the records that are cited. I.e. a node with $\Sigma$ is indicative of a transformation, but that is a description of the papers that cite it and not necessarily the paper that the node represents.
In [32]:
candidates.sort('Sigma', ascending=False)
Out[32]:
In [24]:
clusters = pd.read_csv('clusters.dat', sep='\t', skiprows=9)
In [25]:
clusters
Out[25]:
In [33]:
clusters['Node IDs'][0].split(', ')
Out[33]:
In [34]:
citations = metadata.features['citations']
In [35]:
citing = Counter()
for reference in clusters['Node IDs'][0].split(', '):
for idx in citations.papers_containing(reference):
citing[idx] += 1.
chunk = [idx for idx, value in citing.items() if value > 2.]
This step can take a few minutes.
In [48]:
abstracts = {}
for abstract, wosid in metadata.indices['abstract'].iteritems():
print '\r', wosid[0],
abstracts[wosid[0]] = abstract
In [49]:
abstracts.items()[5]
Out[49]:
In [85]:
document_token_counts = nltk.ConditionalFreqDist([
(wosid, normalize_token(token))
for wosid, abstract in abstracts.items()
for token in nltk.word_tokenize(abstract)
if filter_token(token)
])
In [91]:
extract_keywords(document_token_counts, lambda k: k in chunk)
Out[91]:
In [93]:
cluster_keywords = {}
for i, row in clusters.iterrows():
citing = Counter()
for reference in row['Node IDs'].split(', '):
for idx in citations.papers_containing(reference):
citing[idx] += 1.
chunk = [idx for idx, value in citing.items() if value > 2.]
cluster_keywords[row.Cluster] = extract_keywords(document_token_counts, lambda k: k in chunk)
In [94]:
cluster_keywords
Out[94]: